Udacity Data Analyst Nanodegree - P4: Explore and summarize data
by Gabor Galgocz
6 March 2016

========================================================

Exploring the dataset

Source

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016 [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib

Understanding the purpose of the dataset

I find it very important to look into the origin of the dataset, to understand what was the objective of creating it. The purpose of the dataset was to explore the effect different chemical factors may have on the quality of the white version of the Portuguese wine type called “Vinho Verde”. It is important to understand that the dataset consists of observations of a specific wine type from Portugal, thus we shouldn’t interpret the correlations found within dataset as relevant to any other white wine type. The Vinho Verde has a specific, characteristic flavour, and most probably the wine experts were looking for that specific flavour when they evaluated the wine’s quality. Other white wine types may have different correlations between their chemical characteristics and their perceived quality. More information on the Vinho Verde wine type: https://en.wikipedia.org/wiki/Vinho_Verde

The input variables were 11 chemical parameters of the tested wines, the output variable was the quality of wine, as evaluated by at least 3 wine experts.

Univariate Plots Section

Let’s load the dataset and let’s take a look at the dataset, including the variables, the data types, and also the structure and summary of the data.

The variables

For a detailed description of the variables, we can check the original description of the dataset: https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt

##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"

Attribute information

For more information, read [Cortez et al., 2009].

Input variables (based on physicochemical tests): 1 - fixed acidity (tartaric acid - g / dm^3)
2 - volatile acidity (acetic acid - g / dm^3)
3 - citric acid (g / dm^3)
4 - residual sugar (g / dm^3)
5 - chlorides (sodium chloride - g / dm^3
6 - free sulfur dioxide (mg / dm^3)
7 - total sulfur dioxide (mg / dm^3)
8 - density (g / cm^3)
9 - pH
10 - sulphates (potassium sulphate - g / dm3)
11 - alcohol (% by volume)
Output variable (based on sensory data):
12 - quality (score between 0 and 10)

Description of the attributes

1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines

4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

5 - chlorides: the amount of salt in the wine

6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content

9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

11 - alcohol: the percent alcohol content of the wine

Output variable (based on sensory data): 12 - quality (score between 0 and 10)

What is the structure of your dataset?

## 'data.frame':    4898 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...

The dataset includes 4898 observations across 12 variables, the details and descriptions of the variables can be found above. The input variables have numeric values (except the first one), and Quality, which is the output variable is an integer. In some cases I will be using quality as a factor variable, to make the charts more appropriate.

What is/are the main feature(s) of interest in your dataset?

##        X        fixed.acidity    volatile.acidity  citric.acid    
##  Min.   :   1   Min.   : 3.800   Min.   :0.0800   Min.   :0.0000  
##  1st Qu.:1225   1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700  
##  Median :2450   Median : 6.800   Median :0.2600   Median :0.3200  
##  Mean   :2450   Mean   : 6.855   Mean   :0.2782   Mean   :0.3342  
##  3rd Qu.:3674   3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900  
##  Max.   :4898   Max.   :14.200   Max.   :1.1000   Max.   :1.6600  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.600   Min.   :0.00900   Min.   :  2.00     
##  1st Qu.: 1.700   1st Qu.:0.03600   1st Qu.: 23.00     
##  Median : 5.200   Median :0.04300   Median : 34.00     
##  Mean   : 6.391   Mean   :0.04577   Mean   : 35.31     
##  3rd Qu.: 9.900   3rd Qu.:0.05000   3rd Qu.: 46.00     
##  Max.   :65.800   Max.   :0.34600   Max.   :289.00     
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

I don’t have the necessary background in chemistry to point out any interesting parts of the chemical attributes like acidity, but having a basic familiarity with pH and alcohol content tells me that the “Vinho verde” wines in the sample are pretty acidic compared to other wines, and also their alcohol content can be higher (up to 14.2%) than of the usual wines (10-12%).

Univariate Analysis

Now let’s make a histogram of wine quality, which is our main focus of interest:

The histogram shows that the distribution of the data points is close to normal distribution, although it is slightly skewed.

We can also take a look at the exact numbers to see how many data points are in each category. Looking at the data using both a histogram and a table is a good idea to get an understanding of the distribution of the values and the finer details too. For example on the histogram it’s not easy to see whether there are more wines that belong to the quality category 4 or 8. Adding the table makes this easy.

## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

Another common visualization of the distribution of the data points for one variable is using a box plot. This helps us see the distribution in another way, by highlighting the median, the quartiles and the outliers.

Now I will check all the variables in our dataset to see the possible interesting distributions.

The distributions would be easier to see with adjusted the binwidths, so I am going to do that as the following step.

Also it is a useful step to remove the outliers. I will remove the top and bottom 1%.

Most variables have normal distribution, with a very slight skewed shape to the left. An interesting exception is residual sugar, which shows a bimodal distribution. Some variables have more visibly skewed distributions, I am going to add a log10 scale to see them in more detail.

Volatile acidity on a log10 scale shows a normal distribution, while alcohol content on a log10 scale shows an interesting, slightly bimodal distribution.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

With a basic understanding of chemistry, I predict that pH, alcohol and residual sugar will be the most interesting variables to inspect.

Did you create any new variables from existing variables in the dataset?

At a later point in my investigation I created a “rating” variable, basically transforming the quality variable with a categorical variable with three values only, to make the multivariate visualizations to be more effective.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Yes, as I described above, I added a log10 scale to the alcohol and volatile acidity histograms, to better see their distribution.

Bivariate Plots Section

After having finished the univariate exploration, it’s time to take a look at the correlations between variables. To start with let’s plot all the variables in all their possible combinations, to see which are the combinations which look interesting for further investigation. Using ggpairs, we can have an overview of all the plots.

Apparently we have too many variables for this visualization, the labels are overlapping a bit, and the scatterplots are also not very useful because of the points are covering each other. Let’s investigate the variables in separate pairs. First, let’s see how the different variables correlate with wine quality to see which ones should we plot.

##                              [,1]
## fixed.acidity        -0.113662831
## volatile.acidity     -0.194722969
## citric.acid          -0.009209091
## residual.sugar       -0.097576829
## chlorides            -0.209934411
## free.sulfur.dioxide   0.008158067
## total.sulfur.dioxide -0.174737218
## density              -0.307123313
## pH                    0.099427246
## sulphates             0.053677877
## alcohol               0.435574715

The correlation coefficients vary between positive and negative values, but none of them are close to 1 or -1, meaning there is no very strong correlation between any of the variables and wine quality. Still, let’s investigate the three variables which show the strongest correlations: alcohol (0.43), density (-0.3) and chlorides (-0.2). Since quality is a categorical variable, I will use boxplots to visualize the correlations between quality and the other variables.

The boxplot shows an interesting distribution, the median values are lowest for the average quality wine, and they are somewhat higher for below average quality wine and considerably higher for above average quality wine. We should investigate the trend a bit more in detail. We don’t see though how many wines belong to each quality category, so we should either check the histogram or add jitter at a later step to visualize the sitribution of wines.

The boxplot which plots quality vs density has some aspects that we should improve. The points are strongly overlapping each other, which makes it difficult to see if there are some areas which contain many points. There are also some outliers which we could remove, that would also help to see more clearly, because currently most points are plotted in a small area of the graph.

The boxplot which plots quality vs chlorides is similar to the previous one: many overlapping points, the outliers also contribute to the fact that the majority of the points are in a small area.

Improving the boxplots

Let’s add some jitter to the boxplots and also let’s remove the outliers where needed.

There are no outliers on this first boxplot, so here I added only some jitter to make the individual points more distinguishable. It made it easier to see where the most values are plotted. The correlation is easier to identify now.

For the next two boxplots I also removed the bottom and top 1% of values, to get rid of the outliers.

Comparing the original version of the boxplots with the improved ones, it’s clear how much more the individual points and the overall trend is recognizable.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Plotting all the variables in one big view was not the most efficient way to explore the dataset. Processing all the data took a lot of time, and the individual charts were way too small to discern meaningful details. There were two clear correlations visible, the one being the correlation between sugar and density (positive linear correlation), and the other being between alcohol and density (negative linear correlation). This suggests that I should look into the chemical processes of how sugar, alcohol and density are influencing one another, as there is a clear strong relation between them.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

As described above.

What was the strongest relationship you found?

Inspecting the correlations, it is clear that the only input variable which significantly correlates with quality was the alcohol content.

Multivariate Plots Section

We can add a third variable to visualize, using color coding. The alcohol content is displayed along the X axis, the pH value on the Y axis, while the quality is using the color coding.

It is hard to distinguish between the various quality levels, apparently we have too many factors, and using only the difference between the hues of one single colour is not the most efficient way to visualize the data.

To make the visualization more efficient, we can create categories for quality, and use only the three categories. If we use three different colors, we can identify the three categories easily.

It seems like pH values are not really important factors when it comes to the quality of wine, it seems like the alcohol level is a lot more linked to wine quality. It is very easy to see how the red points, marking wines below average quality are on the left side of the plot, while the high quality wines (green points) are on the right.

We can explore other combinations of variables too, still using the quality as the color coded variable.

Visualizing pH values vs chlorides doesn’t reveal any strong correlations. Let’s try some other variables!

Plotting alcohol vs density reveals two patterns. We’ve seen earlier that there is a strong relation between alcohol and quality, but we see another pattern too: lower quality wines have a higher density, while higher quality wines have lower density.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

The multivariate analysis reinforced what was becoming clear in the previous stage, that alcohol content is the strongest factor that influences quality, and the other factors are not so relevant. When it comes to this dataset, I think multivariate plots were allowing for a more stunning visualzation of the correlations described earlier.

Were there any interesting or surprising interactions between features?

This step didn’t reveal any new findings that was not described by the previous examinations.


Final Plots and Summary

Plot One

Description One

The first plot is a histogram, which explores the distribution of the quality of the wines. This gives the viewer a very good overview of the sample, showing a classical normal distribution. It peaks at 6, this means that the median is at this category. The categories 3 and 9 contain just a few values, these might be considered outliers.

Plot Two

Description Two

This boxplot shows the strong linear correlation between the alcohol content of the wine and its perceived quality. There is a positive correlation between alcohol content and perceived quality, meaning that higher alcohol content correlates with higher perceived quality. There are a few values with quality 3 and 4 which have apparently a higher alcohol %, but using the histogram (or adding the jitter to the boxplot) made it clear that there are only a few items here, the majority of the wines show a clear linear correlation.

Plot Three

Description Three

This is the most stunning visualization of the relationship between alcohol content and perceived wine quality. Using color as a visual cue is a very useful way of communicating the key finding, and it is even improved by the introduction of rating categories. The human eye finds it easier to understand a visualization if it only has a limited number of colors, and adding the rating categories helped to accomplish that. The added smoothing also reveals a slight positive correlation between alcohol content and pH, meaning that higher alcohol content correlates with slightly higher pH values.

Reflection

I found the dataset very interesting, and though my background knowledge in chemistry is limited, luckily the main finding didn’t require a deep understanding of the different kinds of acids and how they affect wine quality. The findings were surprising (at least for me, I never heard about this correlation before) - it made me curious about reading more about the topic - and I think this is the main purpose of EDA, to use some simple methods to orient the research towards further directions.

Some questions that came to my mind during the exploration: - are these findings relevant to other kinds of wine too? Or only the Portuguese Vinho Verde? - I think apart from chemical factors, I find it reasonable to add other factors that are mostly linked to meteorogical data: rainfall and temperature across time, or the chemical properties of the soil. I think these factors are also very important and have a strong correlation with the quality of wine.